Draft ADR for modular observability pipeline#1665

Draft

cjnolan wants to merge 7 commits intomainfrom

modular-observability-adr

Contributor

cjnolan commented Mar 31, 2026

Description

Please include a summary of the changes and the related issue. List any dependencies that are required for this change.

Fixes # (issue)

Any Newly Introduced Dependencies

Please describe any newly introduced 3rd party dependencies in this change. List their name, license information and how they are used in the project.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Checklist:

I agree to use the APACHE-2.0 license for my code changes
I have not introduced any 3rd party dependency changes
I have performed a self-review of my code

cjnolan added 4 commits

March 31, 2026 17:05


          Draft ADR for modular observability pipeline

2b2895a


          Fix linter issue


          Add Implementation Steps stage

26cf0b3


          Add sequence diagram for modular workflow

91ce973

jkossak reviewed

View reviewed changes

design-proposals/observability-modular-hw-metrics-collection.md Outdated Show resolved Hide resolved

design-proposals/observability-modular-hw-metrics-collection.md Outdated Show resolved Hide resolved

design-proposals/observability-modular-hw-metrics-collection.md Outdated Show resolved Hide resolved

design-proposals/observability-modular-hw-metrics-collection.md Outdated Show resolved Hide resolved

design-proposals/observability-modular-hw-metrics-collection.md Outdated Show resolved Hide resolved

design-proposals/observability-modular-hw-metrics-collection.md Outdated Show resolved Hide resolved

cjnolan added 3 commits

April 1, 2026 15:41


          Address review comments

202e5d7


          linter fix

9f350d7


          fix linter issue

4d6beaa

palade reviewed

View reviewed changes

Contributor

palade left a comment

Left a few comments

design-proposals/observability-modular-hw-metrics-collection.md

+              To support collection of additional HW metrics from GPU, PMU, cache utilization, etc., the current POA implementation
+              will be expanded to include new metrics collectors for these HW components. Also, modifications will be made to
+              the Edge Node Observability pipeline deployment in the orchestrator to allow it to be deployable as a standalone
+              pipeline without requiring other components from the EMF stack.

Contributor

palade Apr 3, 2026

without requiring other components from the EMF stack

Which are these components?

design-proposals/observability-modular-hw-metrics-collection.md

+              - **BIOS Metrics**: One option for these metrics is to use the [Telegraf redfish collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/redfish)
+                to retrieve thermal and power settings.
+              ##### New Metrics to Configure and Enable

Contributor

palade Apr 10, 2026 •

edited

Loading

Each of these will require additional permissions to be enabled on the device side and will likely increase resource utilization. Is there any QoS currently enabled to support existing and additional metrics collection? e.g., best-effort collection?

design-proposals/observability-modular-hw-metrics-collection.md

+              Instead the Orchestrator Command Line Interface (CLI) tool will be extended to provide commands for a user
+              to run to query the Mimir backend for metrics.
+              The CLI will receive a command containing the metric to be queried for, the edge node to be checked as well as

Contributor

palade Apr 10, 2026

To clarify this, is the query going to the orchestrator or to the edge node? Or if the requested data is not found on the orchestrator side, then it will be sent to the edge node?

design-proposals/observability-modular-hw-metrics-collection.md

+              to run to query the Mimir backend for metrics.
+              The CLI will receive a command containing the metric to be queried for, the edge node to be checked as well as
+              any time range required by user. If a time range is not provided, then the CLI should use a default time range,

Contributor

palade Apr 10, 2026

Maybe a concern that is already addressed, but how is the clock synchronization is ensured across orchestrator and edge devices to ensure that the requested time range is the same across devices and there is no offset?

design-proposals/observability-modular-hw-metrics-collection.md

+              The CLI will receive a command containing the metric to be queried for, the edge node to be checked as well as
+              any time range required by user. If a time range is not provided, then the CLI should use a default time range,
+              such as the last 5 minutes. The CLI should also support retrieving both averages and sums for metrics over set time
+              periods.

Contributor

palade Apr 10, 2026

Is the request made for a single node or for all the nodes? how does this scale?

design-proposals/observability-modular-hw-metrics-collection.md

+              periods.
+              Within the CLI, it should convert the received query into the PromQL format needed for querying Mimir
+              and then send the PromQL query to the Mimir API. When the CLI receives the metrics back from Mimir,

Contributor

palade Apr 10, 2026 •

edited

Loading

What happens if, for some reason, metrics collection fails or becomes unavailable during the requested time range? What if the data is only partially available?

design-proposals/observability-modular-hw-metrics-collection.md

+                - Provide documentation on how to install the modular observability workflow.
+                - Extend Orchestrator CLI documentation with new commands for metrics querying.
+              ## Opens

Contributor

palade Apr 10, 2026 •

edited

Loading

Has support for multi-vendor environments been considered?

design-proposals/observability-modular-hw-metrics-collection.md

+              ## Implementation Plan
+              - Hardware Metrics Collection.
+                - Identify the new hardware metrics collectors to be added to the current edge node metrics service.

Contributor

palade Apr 10, 2026 •

edited

Loading

What are the performance implications when additional metrics are collected, both on the orchestrator side and across the network?

design-proposals/observability-modular-hw-metrics-collection.md

+                pipeline and can be used to configure what metrics an edge node reports after it has been deployed without
+                requiring a full redeployment or access to the edge node. For modular deployments, should this also be included
+                and used for this purpose or should it be exlcuded?
+              - Investigate the [Intel Performance Counter Monitor(PCM)](https://github.com/intel/pcm) tool as there may be

Contributor

palade Apr 10, 2026 •

edited

Loading

What happens if the hardware metrics collection fails? i.e., there is some hardware malfunction, or due to misconfigurations, or there is some disruption to sensors doing readings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

palade palade left review comments

jkossak jkossak left review comments

johnoloughlin Awaiting requested review from johnoloughlin johnoloughlin will be requested when the pull request is marked ready for review johnoloughlin is a code owner

garyloug Awaiting requested review from garyloug garyloug will be requested when the pull request is marked ready for review garyloug is a code owner

rranjan3 Awaiting requested review from rranjan3 rranjan3 will be requested when the pull request is marked ready for review rranjan3 is a code owner

krishnajs Awaiting requested review from krishnajs krishnajs will be requested when the pull request is marked ready for review krishnajs is a code owner

scottmbaker Awaiting requested review from scottmbaker scottmbaker will be requested when the pull request is marked ready for review scottmbaker is a code owner

damiankopyto Awaiting requested review from damiankopyto damiankopyto will be requested when the pull request is marked ready for review damiankopyto is a code owner

SushilLakra Awaiting requested review from SushilLakra SushilLakra will be requested when the pull request is marked ready for review SushilLakra is a code owner

soniabha-intc Awaiting requested review from soniabha-intc soniabha-intc will be requested when the pull request is marked ready for review soniabha-intc is a code owner

guptagunjan Awaiting requested review from guptagunjan guptagunjan will be requested when the pull request is marked ready for review guptagunjan is a code owner

sys-orch-approve Awaiting requested review from sys-orch-approve sys-orch-approve will be requested when the pull request is marked ready for review sys-orch-approve is a code owner

Ram-srini Awaiting requested review from Ram-srini Ram-srini will be requested when the pull request is marked ready for review Ram-srini is a code owner

shankarsrinivas1 Awaiting requested review from shankarsrinivas1 shankarsrinivas1 will be requested when the pull request is marked ready for review shankarsrinivas1 is a code owner

sunil-parida Awaiting requested review from sunil-parida sunil-parida will be requested when the pull request is marked ready for review sunil-parida is a code owner

At least 1 approving review is required to merge this pull request.

Labels

None yet